About this Notebook

This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. Training the model is done "locally" inside Datalab. In the next notebook (Text Classification --- 20NewsGroup (large data)), it demonstrates how to do it by using Cloud ML Engine services.

If you have any feedback, please send them to datalab-feedback@google.com.

Data

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. The classification problem is to identify the newsgroup a post was summited to, given the text of the post.

There are a few versions of this dataset from different sources online. Below, we use the version within scikit-learn which is already split into a train and test/eval set. For a longer introduction to this dataset, see the scikit-learn website

Download Data



In [59]:

    
import numpy as np
import pandas as pd
import os
import re
import csv
from sklearn.datasets import fetch_20newsgroups



In [60]:

    
# data will be downloaded. Note that an error message saying something like "No handlers could be found for 
# logger sklearn.datasets.twenty_newsgroups" might be printed, but this is not an error.
news_train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
news_test_data = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))

Cleaning the Raw Data

Printing the 3rd element in the test dataset shows the data contains text with newlines, punctuation, misspellings, and other items common in text documents. To build a model, we will clean up the text by removing some of these issues.



In [61]:

    
news_train_data.data[2], news_train_data.target_names[news_train_data.target[2]]









    





          
          
          






    Out[61]:





(u'well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985.  sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected?  i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180?  i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t really have\na feel for how much "better" the display is (yea, it looks great in the\nstore, but is that all "wow" or is it really that good?).  could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display?  (i realize\nthis is a real subjective question, but i\'ve only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform?  ;)\n\nthanks a bunch in advance for any info - if you could email, i\'ll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis  \\  twillis@ecn.purdue.edu    \\    Purdue Electrical Engineering',
 'comp.sys.mac.hardware')



In [62]:

    
def clean_and_tokenize_text(news_data):
    """Cleans some issues with the text data
    Args:
        news_data: list of text strings
    Returns:
        For each text string, an array of tokenized words are returned in a list
    """
    cleaned_text = []
    for text in news_data:
        x = re.sub('[^\w]|_', ' ', text)  # only keep numbers and letters and spaces
        x = x.lower()
        x = re.sub(r'[^\x00-\x7f]',r'', x)  # remove non ascii texts
        tokens = [y for y in x.split(' ') if y] # remove empty words
        tokens = ['[number]' if x.isdigit() else x for x in tokens] # convert all numbers to '[number]' to reduce vocab size.
        cleaned_text.append(tokens)
    return cleaned_text



In [63]:

    
clean_train_tokens = clean_and_tokenize_text(news_train_data.data)
clean_test_tokens = clean_and_tokenize_text(news_test_data.data)

Get Vocabulary

We will need to filter the vocabulary to remove high frequency words and low frequency words.



In [64]:

    
def get_unique_tokens_per_row(text_token_list):
    """Collect unique tokens per row.
    Args:
        text_token_list: list, where each element is a list containing tokenized text
    Returns:
        One list containing the unique tokens in every row. For example, if row one contained
        ['pizza', 'pizza'] while row two contained ['pizza', 'cake', 'cake'], then the output list
        would contain ['pizza' (from row 1), 'pizza' (from row 2), 'cake' (from row 2)]
    """
    words = []
    for row in text_token_list:
        words.extend(list(set(row)))
    return words



In [65]:

    
# Make a plot where the x-axis is a token, and the y-axis is how many text documents
# that token is in. 
words = pd.DataFrame(get_unique_tokens_per_row(clean_train_tokens) , columns=['words'])
token_frequency = words['words'].value_counts() # how many documents contain each token.
token_frequency.plot(logy=True)









    





          
          
          






    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f50fcbd81d0>



In [66]:

    
vocab = token_frequency[np.logical_and(token_frequency < 1000, token_frequency > 10)]
vocab.plot(logy=True)









    





          
          
          






    Out[66]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f5109c4efd0>



In [67]:

    
def filter_text_by_vocab(news_data, vocab):
    """Removes tokens if not in vocab.
    Args:
        news_data: list, where each element is a token list
        vocab: set containing the tokens to keep.
    Returns:
        List of strings containing the final cleaned text data
    """
    text_strs = []
    for row in news_data:
        words_to_keep = [token for token in row if token in vocab or token == '[number]']
        text_strs.append(' '.join(words_to_keep))
    return text_strs



In [68]:

    
clean_train_data = filter_text_by_vocab(clean_train_tokens, set(vocab.index))
clean_test_data = filter_text_by_vocab(clean_test_tokens, set(vocab.index))



In [69]:

    
# Check a few instances of cleaned data
clean_train_data[:3]









    





          
          
          






    Out[69]:





[u'wondering enlighten car saw day [number] door sports car looked late 60s early 70s called doors small addition front bumper separate rest body model name engine specs years production car made history whatever info looking car e mail',
 u'fair number brave souls upgraded si clock oscillator shared experiences poll send brief message detailing experiences procedure top speed cpu rated speed add cards adapters heat hour usage per day floppy disk functionality [number] [number] [number] floppies especially requested next days add network knowledge base done clock upgrade haven answered poll',
 u'folks mac plus finally gave ghost weekend starting life 512k [number] market machine bit sooner intended looking picking powerbook [number] maybe [number] bunch questions hopefully somebody answer anybody dirt next round powerbook expected heard supposed summer haven heard anymore access wondering anybody info anybody heard rumors price drops powerbook line ones duo went through recently impression display [number] probably swing [number] got 80mb disk rather [number] feel better display yea looks great store wow opinions [number] [number] day day worth taking disk size money hit active display realize real subjective question played around machines computer store figured opinions somebody actually uses machine daily might prove helpful perform bunch advance info email ll post summary news reading premium finals around corner tom ecn purdue edu purdue electrical engineering']

Save the Cleaned Data For Training



In [70]:

    
!mkdir -p ./data

with open('./data/train.csv', 'w') as f:
    writer = csv.writer(f, lineterminator='\n')
    for target, text in zip(news_train_data.target, clean_train_data):
        writer.writerow([news_train_data.target_names[target], text])
        
with open('./data/eval.csv', 'w') as f:
    writer = csv.writer(f, lineterminator='\n')
    for target, text in zip(news_test_data.target, clean_test_data):
        writer.writerow([news_test_data.target_names[target], text]) 
        
# Also save the vocab, which will be useful in making new predictions.
with open('./data/vocab.txt', 'w') as f:
    vocab.to_csv(f)

Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

When the dataset is small (like with the 20 newsgroup data), there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).

The next notebook (Text Classification --- 20NewsGroup (large data)) in this sequence shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.



In [71]:

    
import google.datalab.contrib.mlworkbench.commands  # This loads the '%%ml' magics

First, define the dataset we are going to use for training.



In [72]:

    
%%ml dataset create
name: newsgroup_data
format: csv
train: ./data/train.csv
eval: ./data/eval.csv
schema:
    - name: news_label
      type: STRING
    - name: text
      type: STRING



In [73]:

    
%%ml dataset explore
name: newsgroup_data









    





          
          
          






    



train data instances: 11314
eval data instances: 7532

Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. We are going to build a bag of words representation on the text and use this in a linear model. Therefore, the analyze step will compute the vocabularies and related statistics of the data for traing.



In [74]:

    
%%ml analyze
output: ./analysis
data: newsgroup_data
features:
    news_label:
        transform: target
    text:
        transform: bag_of_words









    





          
          
          






    



Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/text_classification_20newsgroup/data/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/text_classification_20newsgroup/data/train.csv analyzed.



In [75]:

    
!ls ./analysis









    





          
          
          






    



features.json  schema.json  stats.json	vocab_news_label.csv  vocab_text.csv

Step 2: Transform

This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training. Because the the 20 newsgroups data is small, this step does not matter, but we do it anyway for illustration. This step is recommended if there are text column in a dataset, and required if there are image columns in a dataset.

We run the transform step for the training and eval data.



In [76]:

    
!rm -rf ./transform



In [77]:

    
%%ml transform --shuffle
output: ./transform
analysis: ./analysis
data: newsgroup_data









    





          
          
          






    



/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.



In [78]:

    
# note: the errors_* files are all 0 size, which means no error.
!ls ./transform/ -l -h









    





          
          
          






    



total 3.1M
-rw-r--r-- 1 root root    0 Oct 19 21:13 errors_eval-00000-of-00001.txt
-rw-r--r-- 1 root root    0 Oct 19 21:13 errors_train-00000-of-00001.txt
-rw-r--r-- 1 root root 1.2M Oct 19 21:13 eval-00000-of-00001.tfrecord.gz
-rw-r--r-- 1 root root 2.0M Oct 19 21:13 train-00000-of-00001.tfrecord.gz

Create a "transformed dataset" to use in next step.



In [79]:

    
%%ml dataset create
name: newsgroup_transformed
train: ./transform/train-*
eval: ./transform/eval-*
format: transformed

Step 3: Training

MLWorkbench automatically builds standard TensorFlow models without you having to write any TensorFlow code.



In [80]:

    
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!rm -fr ./train

The following training step takes about 10~15 minutes.



In [81]:

    
%%ml train
output: ./train
analysis: ./analysis/
data: newsgroup_transformed
model_args:
  model: linear_classification
  top-n: 5









    





          
          
          






    




TensorBoard was started successfully with pid 56037. Click here to access it.

Go to Tensorboard (link shown above) to monitor the training progress. Note that training stops when it detects accuracy is no longer increasing for eval data.



In [82]:

    
# You can also plot the summary events which will be saved with the notebook.

from google.datalab.ml import Summary

summary = Summary('./train')
summary.list_events()









    





          
          
          






    Out[82]:





{u'accuracy': {'./train/train/eval'},
 u'global_step/sec': {'./train/train'},
 u'input_producer/fraction_of_32_full': {'./train/train'},
 u'loss': {'./train/train', './train/train/eval'},
 u'shuffle_batch/fraction_over_10_of_960_full': {'./train/train'}}



In [83]:

    
summary.plot(['loss', 'accuracy'])

The output of training is two models, one in training_output/model and another in training_output/evaluation_model. These tensorflow models are identical except the latter assumes the target column is part of the input and copies the target value to the output. Therefore, the latter is ideal for evaluation.



In [84]:

    
!ls ./train/









    





          
          
          






    



evaluation_model  model  schema_without_target.json  train

Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction locally. Batch prediction is needed for large datasets where the data cannot fit in memory. For demo purpose, we will use the training evaluation data again.



In [85]:

    
%%ml batch_predict
model: ./train/evaluation_model/
output: ./batch_predict
format: csv
data:
  csv: ./data/eval.csv









    





          
          
          






    



local prediction...
INFO:tensorflow:Restoring parameters from ./train/evaluation_model/variables/variables
done.



In [86]:

    
# It creates a results csv file, and a results schema json file.
!ls ./batch_predict









    





          
          
          






    



predict_results_eval.csv  predict_results_schema.json

Note that the output of prediction is a csv file containing the score for each label class. 'predicted_n' is the label for the nth largest score. We care about 'predicted', the final model prediction.



In [87]:

    
!head -n 5 ./batch_predict/predict_results_eval.csv









    





          
          
          






    



rec.autos,comp.sys.mac.hardware,rec.sport.baseball,sci.space,soc.religion.christian,0.291265,0.196776,0.0854143,0.0544117,0.0462434,rec.autos
comp.graphics,comp.windows.x,sci.space,rec.motorcycles,comp.os.ms-windows.misc,0.449551,0.23406,0.0345504,0.0335983,0.0327901,comp.windows.x
rec.motorcycles,rec.sport.baseball,comp.os.ms-windows.misc,alt.atheism,comp.sys.mac.hardware,0.0703423,0.0673938,0.0624348,0.0561686,0.0550833,alt.atheism
talk.politics.mideast,talk.politics.guns,talk.politics.misc,alt.atheism,sci.crypt,0.400145,0.337041,0.133643,0.122765,0.00307648,talk.politics.mideast
alt.atheism,rec.autos,sci.space,talk.religion.misc,comp.sys.mac.hardware,0.130497,0.0720769,0.0676606,0.0658658,0.0625985,talk.religion.misc



In [88]:

    
%%ml evaluate confusion_matrix --plot
csv: ./batch_predict/predict_results_eval.csv



In [89]:

    
%%ml evaluate accuracy
csv: ./batch_predict/predict_results_eval.csv









    





          
          
          






    Out[89]:






  
    
      
      accuracy
      count
      target
    
  
  
    
      0
      0.451411
      319
      alt.atheism
    
    
      1
      0.660668
      389
      comp.graphics
    
    
      2
      0.601523
      394
      comp.os.ms-windows.misc
    
    
      3
      0.556122
      392
      comp.sys.ibm.pc.hardware
    
    
      4
      0.605195
      385
      comp.sys.mac.hardware
    
    
      5
      0.668354
      395
      comp.windows.x
    
    
      6
      0.794872
      390
      misc.forsale
    
    
      7
      0.671717
      396
      rec.autos
    
    
      8
      0.728643
      398
      rec.motorcycles
    
    
      9
      0.831234
      397
      rec.sport.baseball
    
    
      10
      0.822055
      399
      rec.sport.hockey
    
    
      11
      0.623737
      396
      sci.crypt
    
    
      12
      0.544529
      393
      sci.electronics
    
    
      13
      0.669192
      396
      sci.med
    
    
      14
      0.700508
      394
      sci.space
    
    
      15
      0.693467
      398
      soc.religion.christian
    
    
      16
      0.615385
      364
      talk.politics.guns
    
    
      17
      0.667553
      376
      talk.politics.mideast
    
    
      18
      0.403226
      310
      talk.politics.misc
    
    
      19
      0.239044
      251
      talk.religion.misc
    
    
      20
      0.639272
      7532
      _all

Step 5: BigQuery to analyze evaluate results

Sometimes you want to query your prediction/evaluation results using SQL. It is easy.



In [90]:

    
# Create bucket
!gsutil mb gs://bq-mlworkbench-20news-lab
!gsutil cp -r ./batch_predict/predict_results_eval.csv gs://bq-mlworkbench-20news-lab









    





          
          
          






    



Creating gs://bq-mlworkbench-20news-lab/...
Copying file://./batch_predict/predict_results_eval.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/1.1 MiB.



In [91]:

    
# Use Datalab's Bigquery API to load CSV files into table.

import google.datalab.bigquery as bq
import json

with open('./batch_predict/predict_results_schema.json', 'r') as f:
    schema = json.load(f)

# Create BQ Dataset
bq.Dataset('newspredict').create()

# Create the table
table = bq.Table('newspredict.result1').create(schema=schema, overwrite=True)
table.load('gs://bq-mlworkbench-20news-lab/predict_results_eval.csv', mode='overwrite',
           source_format='csv', csv_options=bq.CSVOptions(skip_leading_rows=1))









    





          
          
          






    Out[91]:





Job bradley-playground/job_sGKkVQKLNfRK0j9VpFiIUO1CqcLO completed

Now, run any SQL queries on "table newspredict.result1". Below we query all wrong predictions.



In [92]:

    
%%bq query
SELECT * FROM newspredict.result1 WHERE predicted != target









    





          
          
          






    Out[92]:





    predicted predicted_2 predicted_3 predicted_4 predicted_5 probability probability_2 probability_3 probability_4 probability_5 target
sci.med rec.autos rec.sport.baseball sci.space comp.sys.mac.hardware 0.0944628 0.0768647 0.0702626 0.0680675 0.0661313 sci.electronics
sci.med rec.autos sci.crypt alt.atheism sci.electronics 0.118112 0.0715369 0.0698622 0.0667128 0.0594781 sci.crypt
sci.med rec.autos rec.motorcycles alt.atheism comp.os.ms-windows.misc 0.467449 0.0668502 0.0535573 0.040007 0.0382802 sci.electronics
sci.med rec.autos sci.space comp.graphics comp.sys.mac.hardware 0.098781 0.0740077 0.0660721 0.0658893 0.0632783 soc.religion.christian
sci.med rec.autos sci.space rec.motorcycles comp.graphics 0.0991149 0.0714714 0.0663251 0.0650978 0.058196 rec.sport.hockey
sci.med rec.autos rec.motorcycles talk.politics.guns sci.electronics 0.243752 0.148529 0.141205 0.0822236 0.054355 rec.autos
sci.med sci.crypt comp.graphics sci.space alt.atheism 0.235213 0.178636 0.107028 0.0589985 0.0547229 sci.crypt
sci.med sci.crypt comp.sys.ibm.pc.hardware sci.space alt.atheism 0.0889153 0.0858086 0.0776207 0.0718896 0.0637095 sci.crypt
sci.med sci.crypt talk.politics.guns sci.space rec.autos 0.586723 0.119922 0.0764783 0.0506784 0.0369884 comp.graphics
sci.med sci.crypt soc.religion.christian comp.graphics alt.atheism 0.240414 0.173254 0.125778 0.0825043 0.0749882 soc.religion.christian
sci.med sci.crypt rec.motorcycles comp.graphics sci.electronics 0.129069 0.0951029 0.0820131 0.0694286 0.0616525 sci.electronics
sci.med sci.crypt alt.atheism comp.graphics soc.religion.christian 0.176504 0.100305 0.0980852 0.0749751 0.0527355 talk.politics.misc
sci.med sci.crypt talk.politics.misc talk.politics.guns alt.atheism 0.194313 0.193072 0.124884 0.122438 0.105321 alt.atheism
sci.med sci.space rec.motorcycles rec.autos sci.electronics 0.283608 0.237266 0.0882693 0.0802769 0.0499932 sci.space
sci.med sci.space talk.politics.guns alt.atheism talk.religion.misc 0.326097 0.305573 0.116581 0.0513381 0.0367944 talk.politics.guns
sci.med sci.space rec.sport.baseball alt.atheism talk.politics.mideast 0.214708 0.113775 0.0931925 0.073653 0.0684173 rec.sport.baseball
sci.med sci.space talk.politics.guns alt.atheism talk.politics.misc 0.0855026 0.0840718 0.0736133 0.0724489 0.0668283 talk.politics.misc
sci.med sci.space talk.religion.misc alt.atheism soc.religion.christian 0.321936 0.254984 0.176534 0.165577 0.0339035 talk.religion.misc
sci.med sci.space rec.motorcycles comp.windows.x comp.sys.mac.hardware 0.108344 0.0949181 0.0798288 0.0672603 0.0631208 rec.motorcycles
sci.med sci.space rec.autos sci.electronics talk.politics.guns 0.149185 0.140614 0.135952 0.115165 0.0701548 sci.electronics
sci.med sci.space rec.sport.baseball sci.electronics rec.autos 0.098036 0.0946519 0.0838997 0.0742474 0.0720842 sci.electronics
sci.med sci.space comp.windows.x rec.sport.baseball talk.politics.mideast 0.101074 0.0978412 0.0693363 0.0674033 0.0669934 sci.space
sci.med sci.space rec.sport.baseball talk.politics.guns soc.religion.christian 0.149672 0.111454 0.089652 0.0793766 0.071511 talk.religion.misc
sci.med sci.space comp.sys.mac.hardware talk.politics.misc rec.autos 0.118442 0.0833742 0.0800071 0.0736631 0.070822 comp.windows.x
sci.med sci.space sci.crypt talk.religion.misc alt.atheism 0.147893 0.119379 0.0876462 0.064635 0.06184 talk.religion.misc
    
(rows: 2717, time: 1.2s,     1MB processed, job: job_ixkw-3eJ7XJRFSxhtF0Lzi8Y79rf)

Prediction

Local Instant Prediction

The MLWorkbench also supports running prediction and displaying the results within the notebook. Note that we use the non-evaluation model below (./train/model) which takes input with no target column.



In [93]:

    
%%ml predict
model: ./train/model/
headers: text
data:
  - nasa
  - windows xp









    





          
          
          






    





  
    
      predicted
      predicted_2
      predicted_3
      predicted_4
      predicted_5
      probability
      probability_2
      probability_3
      probability_4
      probability_5
      text
    
  
  
    
      sci.space
      rec.motorcycles
      rec.sport.baseball
      comp.graphics
      rec.autos
      0.088903
      0.063468
      0.061663
      0.060291
      0.057231
      nasa
    
    
      comp.os.ms-windows.misc
      comp.graphics
      misc.forsale
      comp.windows.x
      rec.motorcycles
      0.145055
      0.067206
      0.062420
      0.062265
      0.056137
      windows xp

Why Does My Model Predict this? Prediction Explanation.

"%%ml explain" gives you insights on what are important features in the prediction data that contribute positively or negatively to certain labels. We use LIME under "%%ml explain". (LIME is an open sourced library performing feature sensitivity analysis. It is based on the work presented in this paper. LIME is included in Datalab.)

In this case, we will check which words in text are contributing most to the predicted label.



In [94]:

    
# Pick some data from eval csv file. They are cleaned text.
# The truth labels for the following 3 instances are
# - rec.autos
# - comp.windows.x
# - talk.politics.mideast

instance0 = ('little confused models [number] [number] heard le se someone tell differences far features ' +
            'performance curious book value [number] model less book value usually words demand ' +
            'year heard mid spring early summer best buy')
instance1 = ('hi requirement closing opening different display servers within x application manner display ' +
            'associated client proper done during transition problems')
instance2 = ('attacking drive kuwait country whose citizens close blood business ties saudi citizens thinks ' +
            'helped saudi arabia least eastern muslim country doing anything help kuwait protect saudi arabia ' +
            'indeed masses citizens demonstrating favor butcher saddam killed muslims killing relatively rich ' +
            'muslims nose west saudi arabia rolled iraqi invasion charge saudi arabia idea governments official ' +
            'religion de facto de human nature always ones rise power world country citizens leader slick ' +
            'operator sound guys angels posting edited stuff following friday york times reported group definitely ' +
            'conservative followers house rule country enough reported besides complaining government conservative ' +
            'enough asserted approx [number] [number] kingdom charge under saudi islamic law brings death penalty ' +
            'diplomatic guy bin isn called severe punishment [number] women drove public while protest ban women ' +
            'driving guy group said al said women fired jobs happen heard muslims ban women driving basis qur etc ' +
            'yet folks ban women called choose rally behind hate women allowed tv radio immoral kingdom house neither ' +
            'least nor favorite government earth restrict religious political lot among things likely replacements ' +
            'going lot worse citizens country house feeling heat lately last six months read religious police ' +
            'government western women fully stupid women imo sends wrong signals morality read cracked down few home ' +
            'based religious posted government owned newspapers offering money turns group dare worship homes secret ' +
            'place government grown try take wind conservative opposition things small taste happen guys house trying ' +
            'long run others general west evil zionists rule hate west crowd')

data = [instance0, instance1, instance2]



In [95]:

    
%%ml predict
model: ./train/model/
headers: text
data: $data









    





          
          
          






    





  
    
      predicted
      predicted_2
      predicted_3
      predicted_4
      predicted_5
      probability
      probability_2
      probability_3
      probability_4
      probability_5
      text
    
  
  
    
      rec.autos
      comp.sys.mac.hardware
      rec.sport.baseball
      sci.space
      soc.religion.christian
      0.291265
      0.196776
      0.085414
      0.054412
      0.046243
      little confused models [number] [numb...
    
    
      comp.windows.x
      comp.graphics
      comp.sys.mac.hardware
      comp.os.ms-windows.misc
      sci.space
      0.474711
      0.072923
      0.060589
      0.048182
      0.046209
      hi requirement closing opening differ...
    
    
      talk.politics.mideast
      talk.politics.guns
      talk.politics.misc
      alt.atheism
      sci.crypt
      0.400145
      0.337041
      0.133643
      0.122765
      0.003076
      attacking drive kuwait country whose ...

The first and second instances are predicted correctly. The third is wrong. Below we run "%%ml explain" to understand more.



In [96]:

    
%%ml explain --detailview_only
model: ./train/model
labels: rec.autos
type: text
data: $instance0



In [97]:

    
%%ml explain --detailview_only
model: ./train/model
labels: comp.windows.x
type: text
data: $instance1

On instance 2, the top prediction result does not match truth. Predicted is "talk.politics.guns" while truth is "talk.politics.mideast". So let's analyze these two labels.



In [98]:

    
%%ml explain --detailview_only
model: ./train/model
labels: talk.politics.guns,talk.politics.mideast
type: text
data: $instance2

Deploying Model to ML Engine

Now that we have a trained model, have analyzed the results, and have tested the model output locally, we are ready to deploy it to the cloud for real predictions.

Deploying a model requires the files are on GCS. The next few cells makes a bucket on GCS, copies the locally trained model, and deploys it.



In [99]:

    
!gsutil -q mb gs://bq-mlworkbench-20news-lab









    





          
          
          






    



ServiceException: 409 Bucket bq-mlworkbench-20news-lab already exists.



In [100]:

    
# Move the regular model to GCS
!gsutil -m cp -r ./train/model gs://bq-mlworkbench-20news-lab









    





          
          
          






    



Copying file://./train/model/assets.extra/features.json [Content-Type=application/json]...
Copying file://./train/model/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://./train/model/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://./train/model/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://./train/model/assets.extra/schema.json [Content-Type=application/json]...
- [5/5 files][901.2 KiB/901.2 KiB] 100% Done                                    
Operation completed over 5 objects/901.2 KiB.

See this doc https://cloud.google.com/ml-engine/docs/how-tos/managing-models-jobs for a the definition of ML Engine models and versions. An ML Engine version runs predictions and is contained in a ML Engine model. We will create a new ML Engine model, and depoly the TensorFlow graph as a ML Engine version. This can be done using gcloud (see https://cloud.google.com/ml-engine/docs/how-tos/deploying-models), or Datalab which we use below.



In [101]:

    
%%ml model deploy
path: gs://bq-mlworkbench-20news-lab
name: news.alpha









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/create_news_alpha-1508447816124"
Done.

How to Build Your Own Prediction Client

A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction.

Covering model permissions topics is outside the scope of this notebook, but for more information see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .



In [102]:

    
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors

# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='news',
    version_name='alpha')

# Get application default credentials (possible only if the gcloud tool is
#  configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
#  for more info.
credentials = GoogleCredentials.get_application_default()

# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)

# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
    'instances': ['nasa',
                  'windows ex']}

# Create a request
request = ml.projects().predict(
    name=api_path,
    body=body)

print('The JSON request: \n')
print(request.to_json())

# Make the call.
try:
    response = request.execute()
    print('\nThe response:\n')
    print(json.dumps(response, indent=2))
except errors.HttpError, err:
    # Something went wrong, print out some information.
    print('There was an error. Check the details:')
    print(err._get_reason())









    





          
          
          






    



The JSON request: 

{"body": "{\"instances\": [\"nasa\", \"windows ex\"]}", "resumable_uri": null, "headers": {"content-type": "application/json", "accept-encoding": "gzip, deflate", "accept": "application/json", "user-agent": "google-api-python-client/1.6.2 (gzip)"}, "uri": "https://ml.googleapis.com/v1/projects/bradley-playground/models/news/versions/alpha:predict?alt=json", "resumable": null, "methodId": "ml.projects.predict", "body_size": 37, "resumable_progress": 0, "method": "POST", "_in_error_state": false, "response_callbacks": []}

The response:

{
  "predictions": [
    {
      "probability": 0.08890275657176971, 
      "probability_5": 0.05723080784082413, 
      "probability_4": 0.06029100343585014, 
      "predicted": "sci.space", 
      "probability_3": 0.06166268140077591, 
      "probability_2": 0.06346812099218369, 
      "predicted_2": "rec.motorcycles", 
      "predicted_3": "rec.sport.baseball", 
      "predicted_4": "comp.graphics", 
      "predicted_5": "rec.autos"
    }, 
    {
      "probability": 0.1440439522266388, 
      "probability_5": 0.060565147548913956, 
      "probability_4": 0.06096724420785904, 
      "predicted": "comp.os.ms-windows.misc", 
      "probability_3": 0.06115517392754555, 
      "probability_2": 0.06615085899829865, 
      "predicted_2": "comp.graphics", 
      "predicted_3": "misc.forsale", 
      "predicted_4": "comp.windows.x", 
      "predicted_5": "rec.motorcycles"
    }
  ]
}

To demonstrate prediction client further, check API Explorer (https://developers.google.com/apis-explorer). it allows you to send raw HTTP requests to many Google APIs. This is useful for understanding the requests and response, and help you build your own client with your favorite language.

Please visit https://developers.google.com/apis-explorer/#search/ml%20engine/ml/v1/ml.projects.predict and enter the following values for each text box.



In [103]:

    
# The output of this cell is placed in the name box
# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='news',
    version_name='alpha')
print('Place the following in the name box')
print(api_path)









    





          
          
          






    



Place the following in the name box
projects/bradley-playground/models/news/versions/alpha

The fields text box can be empty.

Note that because we deployed the non-evaluation model, our depolyed model takes a csv input which only has one column. In general, the "instances" is a list of csv strings for models trained by MLWorkbench.

Click in the request body box, and note a small drop down menu appears in the FAR RIGHT of the input box. Slect "Freeform editor". Then enter the following in the request body box.



In [104]:

    
print('Place the following in the request body box')
request = {'instances': ['nasa', 'windows xp']}
print(json.dumps(request))









    





          
          
          






    



Place the following in the request body box
{"instances": ["nasa", "windows xp"]}

Then click the "Authorize and execute" button. The prediction results are returned in the browser.

Cleaning up the deployed model



In [105]:

    
%%ml model delete
name: news.alpha









    





          
          
          






    



Waiting for operation "projects/bradley-playground/operations/delete_news_alpha-1508447899648"
Done.



In [ ]:

    
%%ml model delete
name: news



In [107]:

    
# Delete the GCS bucket
!gsutil -m rm -r gs://bq-mlworkbench-20news-lab









    





          
          
          






    



Removing gs://bq-mlworkbench-20news-lab/predict_results_eval.csv#1508447773068028...
Removing gs://bq-mlworkbench-20news-lab/model/assets.extra/features.json#1508447803765889...
Removing gs://bq-mlworkbench-20news-lab/model/assets.extra/schema.json#1508447803782795...
Removing gs://bq-mlworkbench-20news-lab/model/saved_model.pb#1508447803954460...
Removing gs://bq-mlworkbench-20news-lab/model/variables/variables.data-00000-of-00001#1508447804049817...
Removing gs://bq-mlworkbench-20news-lab/model/variables/variables.index#1508447803769614...
/ [6/6 objects] 100% Done                                                       
Operation completed over 6 objects.                                              
Removing gs://bq-mlworkbench-20news-lab/...



In [108]:

    
# Delete BQ table

bq.Dataset('newspredict').delete(delete_contents = True)

	accuracy	count	target
0	0.451411	319	alt.atheism
1	0.660668	389	comp.graphics
2	0.601523	394	comp.os.ms-windows.misc
3	0.556122	392	comp.sys.ibm.pc.hardware
4	0.605195	385	comp.sys.mac.hardware
5	0.668354	395	comp.windows.x
6	0.794872	390	misc.forsale
7	0.671717	396	rec.autos
8	0.728643	398	rec.motorcycles
9	0.831234	397	rec.sport.baseball
10	0.822055	399	rec.sport.hockey
11	0.623737	396	sci.crypt
12	0.544529	393	sci.electronics
13	0.669192	396	sci.med
14	0.700508	394	sci.space
15	0.693467	398	soc.religion.christian
16	0.615385	364	talk.politics.guns
17	0.667553	376	talk.politics.mideast
18	0.403226	310	talk.politics.misc
19	0.239044	251	talk.religion.misc
20	0.639272	7532	_all